Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Frontier machine and compiler configuration to fix issues with crayclanggpu and amdclanggpu #6771

Merged
merged 1 commit into from
Nov 25, 2024

Conversation

grnydawn
Copy link
Contributor

@grnydawn grnydawn commented Nov 22, 2024

The update includes:

  • Changed Core from 24.07 to 24.00.
  • Updated CMake from 3.27.9 to 3.21.3.
  • Restored --allow-multiple-definition in CMAKE_EXE_LINKER_FLAGS.
  • Removed the craype-accel-amd-gfx90a module load for amdclanggpu_frontier.
  • Added two MOSART source files to NOOPT_FILES as a workaround to prevent a crayclanggpu build error.

Fixes #6750, Fixes #6755

Note: This PR disables OpenMP offload and OpenACC for the AMD compiler on Frontier.

Note: Optimization is disabled for the following MOSART source files:

  • mosart/src/wrm/WRM_subw_IO_mod.F90
  • mosart/src/riverroute/RtmMod.F90

[BFB] No baseline for Frontier yet.

Copy link

github-actions bot commented Nov 22, 2024

PR Preview Action v1.4.8
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6771/
on branch gh-pages at 2024-11-25 18:43 UTC

@grnydawn
Copy link
Contributor Author

Testing is ongoing, and the results will be posted here once it finishes. So far, the test outcomes are not worse than the previous test results; in some cases, they are better.

Please invite anyone to review this PR if you think it is necessary.

@grnydawn grnydawn self-assigned this Nov 22, 2024
@rljacob
Copy link
Member

rljacob commented Nov 22, 2024

Are you trying to get scream cases to build and run with this PR?

@dqwu
Copy link
Contributor

dqwu commented Nov 22, 2024

Are you trying to get scream cases to build and run with this PR?

This PR tries to fix some known build issues for machine frontier. No changes to machine frontier-scream-gpu.
I think it has nothing to do with scream cases so the title might needs to be updated.

@rljacob
Copy link
Member

rljacob commented Nov 22, 2024

We want to abolish frontier-scream-gpu and be able to run scream cases with the frontier machine entry. Perhaps that will wait for another PR.

@dqwu
Copy link
Contributor

dqwu commented Nov 22, 2024

@grnydawn If this PR can fix issue 6750 and 6755 (6764 might only be partially fixed for some cases), you should probably add fixes #xxxx in the description.

@grnydawn
Copy link
Contributor Author

We want to abolish frontier-scream-gpu and be able to run scream cases with the frontier machine entry. Perhaps that will wait for another PR.

@rljacob , Sure, I will work on removing frontier-scream-gpu with the Scream group after this PR.

@sarats
Copy link
Member

sarats commented Nov 22, 2024

Changed Core from 24.07 to 24.00.
Updated CMake from 3.27.9 to 3.21.3.

Just curious, why the downgrade in versions? esp. CMake?

@dqwu
Copy link
Contributor

dqwu commented Nov 22, 2024

Changed Core from 24.07 to 24.00.
Updated CMake from 3.27.9 to 3.21.3.

Just curious, why the downgrade in versions? esp. CMake?

This is required to fix issue 6750 (looks like a CMake bug to be reported to Kitware).

@sarats
Copy link
Member

sarats commented Nov 22, 2024

Also FYI (my comment in the file about -Wl,--allow-shlib-undefined). I would vote to leave it in to prevent similar issues, it's typically harmless.

# -Wl,--allow-shlib-undefined was added to address rocm 5.4.3 Fortran linker issue:
# /opt/rocm-5.4.3/lib/libhsa-runtime64.so.1: undefined reference to `std::condition_variable::wait(std::unique_lock<std::mutex>&)@GLIBCXX_3.4.30'
# AMD started building with GCC 12.2.0, which brings in a GLIBCXX symbol that isn't in CCE's default GCC toolchain.

@dqwu
Copy link
Contributor

dqwu commented Nov 22, 2024

Also FYI (my comment in the file about -Wl,--allow-shlib-undefined). I would vote to leave it in to prevent similar issues, it's typically harmless.

# -Wl,--allow-shlib-undefined was added to address rocm 5.4.3 Fortran linker issue:
# /opt/rocm-5.4.3/lib/libhsa-runtime64.so.1: undefined reference to `std::condition_variable::wait(std::unique_lock<std::mutex>&)@GLIBCXX_3.4.30'
# AMD started building with GCC 12.2.0, which brings in a GLIBCXX symbol that isn't in CCE's default GCC toolchain.

This one can be restored if your prefer to do so, though it looks optional so far.

@grnydawn
Copy link
Contributor Author

Changed Core from 24.07 to 24.00.
Updated CMake from 3.27.9 to 3.21.3.

Just curious, why the downgrade in versions? esp. CMake?

@sarats , Some combinations of particular versions of CMake and Core mode work with the Scorpio build with threading, while others do not. Please see E3SM Issue #6750 for the discussions.

@grnydawn grnydawn closed this Nov 22, 2024
@grnydawn grnydawn reopened this Nov 22, 2024
@grnydawn
Copy link
Contributor Author

@grnydawn If this PR can fix issue 6750 and 6755 (6764 might only be partially fixed for some cases), you should probably add fixes #xxxx in the description.

@dqwu , I added notes for fixing issues #6750 and #6755. Issue #6764 has not been verified yet and will be updated if it is also fixed.

@grnydawn
Copy link
Contributor Author

Also FYI (my comment in the file about -Wl,--allow-shlib-undefined). I would vote to leave it in to prevent similar issues, it's typically harmless.

# -Wl,--allow-shlib-undefined was added to address rocm 5.4.3 Fortran linker issue:
# /opt/rocm-5.4.3/lib/libhsa-runtime64.so.1: undefined reference to `std::condition_variable::wait(std::unique_lock<std::mutex>&)@GLIBCXX_3.4.30'
# AMD started building with GCC 12.2.0, which brings in a GLIBCXX symbol that isn't in CCE's default GCC toolchain.

@sarats , @dqwu , I have not observed any differences in test results, whether using the compiler flag or not. I will restore it if preferred.

@dqwu
Copy link
Contributor

dqwu commented Nov 22, 2024

@grnydawn If this PR can fix issue 6750 and 6755 (6764 might only be partially fixed for some cases), you should probably add fixes #xxxx in the description.

@dqwu , I added notes for fixing issues #6750 and #6755. Issue #6764 has not been verified yet and will be updated if it is also fixed.

@grnydawn FYI, in GitHub, the correct format to auto-close multiple issues is:
Fixes #xxxx,#yyyy

The key here is to separate the issue numbers with a comma, without any extra spaces between them.
Please update the description accordingly, otherwise issue 6755 will not be auto-closed.

You can also use Fixes #xxxx, Fixes #yyyy

@grnydawn
Copy link
Contributor Author

@grnydawn If this PR can fix issue 6750 and 6755 (6764 might only be partially fixed for some cases), you should probably add fixes #xxxx in the description.

@dqwu , I added notes for fixing issues #6750 and #6755. Issue #6764 has not been verified yet and will be updated if it is also fixed.

@grnydawn FYI, in GitHub, the correct format to auto-close multiple issues is: Fixes: #xxxx, #yyyy

The key here is to separate the issue numbers with a comma, without any extra spaces between them. Please update the description accordingly, otherwise issue 6755 will not be auto-closed.

@dqwu , I updated my PR description. Thanks!

@grnydawn grnydawn changed the title Update Frontier machine and compiler configuration to support Scream CIME merge Update Frontier machine and compiler configuration to fix issues with crayclanggpu and amdclanggpu Nov 22, 2024
Copy link
Contributor

@philipwjones philipwjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are fine with me

@grnydawn
Copy link
Contributor Author

Thank you all for your reviews. The test results of this PR are quite similar to the test results of the current Frontier machine/compiler configurations. The build time for CrayClang has improved, taking about half the build time on average.

Test result crayclang-this PR crayclang-Current amdclang-this PR amdclang-Current gnu-this PR gnu-Current
PASS 64 63 57 58 65 65
FAIL 8 9 15 14 7 7
Avg. Build time 597.1 1137 248.2 289 150.5 133

* Changed from Core/24.07 to Core/24.00
* Changed from cmake/3.27.9 to cmake/3.21.3
* Restored '--allow-shlib-undefined --allow-multiple-definition' in CMAKE_EXE_LINKER_FLAGS
* Removed 'craype-accel-amd-gfx90a' module load for amdclanggpu_frontier
* Added two mosart source files into NOOPT_FILES to workaround to prevent optcg crayclanggpu build error
@grnydawn grnydawn force-pushed the ykim/frontier/scream-cime-merge branch from d37d52b to 71b1a35 Compare November 25, 2024 18:41
grnydawn added a commit that referenced this pull request Nov 25, 2024
* Changed from Core/24.07 to Core/24.00
* Changed from cmake/3.27.9 to cmake/3.21.3
* Restored '--allow-shlib-undefined --allow-multiple-definition' in CMAKE_EXE_LINKER_FLAGS
* Removed 'craype-accel-amd-gfx90a' module load for amdclanggpu_frontier
* Added two mosart source files into NOOPT_FILES to workaround to prevent optcg crayclanggpu build error
@grnydawn grnydawn merged commit 3366392 into master Nov 25, 2024
9 checks passed
@grnydawn grnydawn deleted the ykim/frontier/scream-cime-merge branch November 25, 2024 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants